Language Dynamics and Capitalization using Maximum Entropy

نویسندگان

  • Fernando Batista
  • Nuno J. Mamede
  • Isabel Trancoso
چکیده

This paper studies the impact of written language variations and the way it affects the capitalization task over time. A discriminative approach, based on maximum entropy models, is proposed to perform capitalization, taking the language changes into consideration. The proposed method makes it possible to use large corpora for training. The evaluation is performed over newspaper corpora using different testing periods. The achieved results reveal a strong relation between the capitalization performance and the elapsed time between the training and testing data periods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The impact of language dynamics on the capitalization of broadcast news

This paper investigates the impact of language dynamics on the capitalization of transcriptions of broadcast news. Most of the capitalization information is provided by a large newspaper corpus. Three different speech corpora subsets, from different time periods, are used for evaluation, assessing the importance of available training data in nearby time periods. Results are provided both for ma...

متن کامل

ACL - 08 : HLT 46 th Annual Meeting of the Association for Computational Linguistics : Human Language Technologies

This paper studies the impact of written language variations and the way it affects the capitalization task over time. A discriminative approach, based on maximum entropy models, is proposed to perform capitalization, taking the language changes into consideration. The proposed method makes it possible to use large corpora for training. The evaluation is performed over newspaper corpora using d...

متن کامل

Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news

The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, includi...

متن کامل

Temporal Issues and Recognition Errors on the Capitalization of Speech Transcriptions

This paper investigates the capitalization task over Broadcast News speech transcriptions. Most of the capitalization information is provided by two large newspaper corpora, and the spoken language model is produced by retraining the newspaper language models with spoken data. Three different corpora subsets from different time periods are used for evaluation, revealing the importance of availa...

متن کامل

Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages

This paper shows experimental results concerning automatic enrichment of the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach. The approach is language independent as reinforced by experiments performed on Portuguese and Spanish Broadcast News corpora. The discrimi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008